Note
Click here to download the full example code
Dataset summary: Dengue - grouped by patient¶
Report generated using dataprep.
Dataset Statistics
| Number of Variables | 14 |
|---|---|
| Number of Rows | 15036 |
| Missing Cells | 0 |
| Missing Cells (%) | 0.0% |
| Duplicate Rows | 32 |
| Duplicate Rows (%) | 0.2% |
| Total Size in Memory | 3.4 MB |
| Average Row Size in Memory | 235.3 B |
Variable Types
| Categorical | 9 |
|---|---|
| Numerical | 5 |
dsource
categorical
| Distinct Count | 10 |
|---|---|
| Unique (%) | 0.1% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory Size | 1.8 MB |
Length
| Mean | 3.1392 |
|---|---|
| Standard Deviation | 0.995 |
| Median | 4 |
| Minimum | 2 |
| Maximum | 5 |
Sample
| 1st row | 01nva |
|---|---|
| 2nd row | 01nva |
| 3rd row | 01nva |
| 4th row | 01nva |
| 5th row | 01nva |
Letter
| Count | 30071 |
|---|---|
| Lowercase Letter | 30071 |
| Space Separator | 0 |
| Uppercase Letter | 0 |
| Dash Punctuation | 0 |
| Decimal Number | 17130 |
age
numerical
| Distinct Count | 53 |
|---|---|
| Unique (%) | 0.4% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Memory Size | 1.1 MB |
| Mean | 8.4057 |
| Minimum | 0 |
| Maximum | 18 |
| Zeros | 4 |
| Zeros (%) | 0.0% |
| Negatives | 0 |
| Negatives (%) | 0.0% |
Quantile Statistics
| Minimum | 0 |
|---|---|
| 5-th Percentile | 2 |
| Q1 | 5 |
| Median | 9 |
| Q3 | 12 |
| 95-th Percentile | 14 |
| Maximum | 18 |
| Range | 18 |
| IQR | 7 |
Descriptive Statistics
| Mean | 8.4057 |
|---|---|
| Standard Deviation | 3.9748 |
| Variance | 15.7993 |
| Sum | 126388.53 |
| Skewness | -0.06325 |
| Kurtosis | -0.8416 |
| Coefficient of Variation | 0.4729 |
gender
categorical
| Distinct Count | 2 |
|---|---|
| Unique (%) | 0.0% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory Size | 1.8 MB |
Length
| Mean | 4.867 |
|---|---|
| Standard Deviation | 0.9911 |
| Median | 4 |
| Minimum | 4 |
| Maximum | 6 |
Sample
| 1st row | Male |
|---|---|
| 2nd row | Female |
| 3rd row | Female |
| 4th row | Male |
| 5th row | Female |
Letter
| Count | 73180 |
|---|---|
| Lowercase Letter | 58144 |
| Space Separator | 0 |
| Uppercase Letter | 15036 |
| Dash Punctuation | 0 |
| Decimal Number | 0 |
weight
numerical
| Distinct Count | 355 |
|---|---|
| Unique (%) | 2.4% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Memory Size | 1.1 MB |
| Mean | 28.7932 |
| Minimum | 7.2 |
| Maximum | 114 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negatives | 0 |
| Negatives (%) | 0.0% |
Quantile Statistics
| Minimum | 7.2 |
|---|---|
| 5-th Percentile | 12 |
| Q1 | 19 |
| Median | 26.5 |
| Q3 | 37 |
| 95-th Percentile | 52 |
| Maximum | 114 |
| Range | 106.8 |
| IQR | 18 |
Descriptive Statistics
| Mean | 28.7932 |
|---|---|
| Standard Deviation | 12.8574 |
| Variance | 165.3136 |
| Sum | 432935.1 |
| Skewness | 0.8498 |
| Kurtosis | 0.8271 |
| Coefficient of Variation | 0.4465 |
bleeding
categorical
| Distinct Count | 2 |
|---|---|
| Unique (%) | 0.0% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory Size | 1.8 MB |
Length
| Mean | 4.7447 |
|---|---|
| Standard Deviation | 0.4361 |
| Median | 5 |
| Minimum | 4 |
| Maximum | 5 |
Sample
| 1st row | True |
|---|---|
| 2nd row | False |
| 3rd row | True |
| 4th row | False |
| 5th row | False |
Letter
| Count | 71341 |
|---|---|
| Lowercase Letter | 56305 |
| Space Separator | 0 |
| Uppercase Letter | 15036 |
| Dash Punctuation | 0 |
| Decimal Number | 0 |
plt
numerical
| Distinct Count | 1453 |
|---|---|
| Unique (%) | 9.7% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Memory Size | 1.1 MB |
| Mean | 1645.5563 |
| Minimum | 3 |
| Maximum | 152152 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negatives | 0 |
| Negatives (%) | 0.0% |
Quantile Statistics
| Minimum | 3 |
|---|---|
| 5-th Percentile | 24 |
| Q1 | 74 |
| Median | 175 |
| Q3 | 251 |
| 95-th Percentile | 405.25 |
| Maximum | 152152 |
| Range | 152149 |
| IQR | 177 |
Descriptive Statistics
| Mean | 1645.5563 |
|---|---|
| Standard Deviation | 8719.5556 |
| Variance | 7.6031e+07 |
| Sum | 2.4743e+07 |
| Skewness | 7.1895 |
| Kurtosis | 60.1581 |
| Coefficient of Variation | 5.2988 |
shock
categorical
| Distinct Count | 2 |
|---|---|
| Unique (%) | 0.0% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory Size | 1.8 MB |
Length
| Mean | 4.9524 |
|---|---|
| Standard Deviation | 0.213 |
| Median | 5 |
| Minimum | 4 |
| Maximum | 5 |
Sample
| 1st row | True |
|---|---|
| 2nd row | True |
| 3rd row | True |
| 4th row | True |
| 5th row | True |
Letter
| Count | 74464 |
|---|---|
| Lowercase Letter | 59428 |
| Space Separator | 0 |
| Uppercase Letter | 15036 |
| Dash Punctuation | 0 |
| Decimal Number | 0 |
haematocrit_percent
numerical
| Distinct Count | 564 |
|---|---|
| Unique (%) | 3.8% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Memory Size | 1.1 MB |
| Mean | 41.4304 |
| Minimum | 21 |
| Maximum | 67.05 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negatives | 0 |
| Negatives (%) | 0.0% |
Quantile Statistics
| Minimum | 21 |
|---|---|
| 5-th Percentile | 33.6 |
| Q1 | 37.3 |
| Median | 40.5 |
| Q3 | 45 |
| 95-th Percentile | 52 |
| Maximum | 67.05 |
| Range | 46.05 |
| IQR | 7.7 |
Descriptive Statistics
| Mean | 41.4304 |
|---|---|
| Standard Deviation | 5.6391 |
| Variance | 31.7989 |
| Sum | 622947.5593 |
| Skewness | 0.6076 |
| Kurtosis | 0.1263 |
| Coefficient of Variation | 0.1361 |
bleeding_gum
categorical
| Distinct Count | 2 |
|---|---|
| Unique (%) | 0.0% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory Size | 1.8 MB |
Length
| Mean | 4.8942 |
|---|---|
| Standard Deviation | 0.3076 |
| Median | 5 |
| Minimum | 4 |
| Maximum | 5 |
Sample
| 1st row | True |
|---|---|
| 2nd row | False |
| 3rd row | True |
| 4th row | False |
| 5th row | False |
Letter
| Count | 73589 |
|---|---|
| Lowercase Letter | 58553 |
| Space Separator | 0 |
| Uppercase Letter | 15036 |
| Dash Punctuation | 0 |
| Decimal Number | 0 |
abdominal_pain
categorical
| Distinct Count | 2 |
|---|---|
| Unique (%) | 0.0% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory Size | 1.8 MB |
Length
| Mean | 4.6887 |
|---|---|
| Standard Deviation | 0.463 |
| Median | 5 |
| Minimum | 4 |
| Maximum | 5 |
Sample
| 1st row | True |
|---|---|
| 2nd row | True |
| 3rd row | True |
| 4th row | True |
| 5th row | True |
Letter
| Count | 70500 |
|---|---|
| Lowercase Letter | 55464 |
| Space Separator | 0 |
| Uppercase Letter | 15036 |
| Dash Punctuation | 0 |
| Decimal Number | 0 |
ascites
categorical
| Distinct Count | 2 |
|---|---|
| Unique (%) | 0.0% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory Size | 1.8 MB |
Length
| Mean | 4.845 |
|---|---|
| Standard Deviation | 0.3619 |
| Median | 5 |
| Minimum | 4 |
| Maximum | 5 |
Sample
| 1st row | False |
|---|---|
| 2nd row | False |
| 3rd row | False |
| 4th row | False |
| 5th row | False |
Letter
| Count | 72849 |
|---|---|
| Lowercase Letter | 57813 |
| Space Separator | 0 |
| Uppercase Letter | 15036 |
| Dash Punctuation | 0 |
| Decimal Number | 0 |
bleeding_mucosal
categorical
| Distinct Count | 2 |
|---|---|
| Unique (%) | 0.0% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory Size | 1.8 MB |
Length
| Mean | 4.8212 |
|---|---|
| Standard Deviation | 0.3832 |
| Median | 5 |
| Minimum | 4 |
| Maximum | 5 |
Sample
| 1st row | False |
|---|---|
| 2nd row | False |
| 3rd row | True |
| 4th row | False |
| 5th row | False |
Letter
| Count | 72492 |
|---|---|
| Lowercase Letter | 57456 |
| Space Separator | 0 |
| Uppercase Letter | 15036 |
| Dash Punctuation | 0 |
| Decimal Number | 0 |
bleeding_skin
categorical
| Distinct Count | 2 |
|---|---|
| Unique (%) | 0.0% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory Size | 1.8 MB |
Length
| Mean | 4.523 |
|---|---|
| Standard Deviation | 0.4995 |
| Median | 5 |
| Minimum | 4 |
| Maximum | 5 |
Sample
| 1st row | False |
|---|---|
| 2nd row | False |
| 3rd row | True |
| 4th row | False |
| 5th row | False |
Letter
| Count | 68008 |
|---|---|
| Lowercase Letter | 52972 |
| Space Separator | 0 |
| Uppercase Letter | 15036 |
| Dash Punctuation | 0 |
| Decimal Number | 0 |
body_temperature
numerical
| Distinct Count | 1220 |
|---|---|
| Unique (%) | 8.1% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Memory Size | 1.1 MB |
| Mean | 37.8766 |
| Minimum | 35 |
| Maximum | 41.5 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negatives | 0 |
| Negatives (%) | 0.0% |
Quantile Statistics
| Minimum | 35 |
|---|---|
| 5-th Percentile | 37 |
| Q1 | 37.2143 |
| Median | 37.6333 |
| Q3 | 38.5 |
| 95-th Percentile | 39.5 |
| Maximum | 41.5 |
| Range | 6.5 |
| IQR | 1.2857 |
Descriptive Statistics
| Mean | 37.8766 |
|---|---|
| Standard Deviation | 0.8523 |
| Variance | 0.7265 |
| Sum | 569512.4822 |
| Skewness | 0.8577 |
| Kurtosis | 0.04737 |
| Coefficient of Variation | 0.0225 |
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 | import pandas as pd
import numpy as np
from dataprep.eda import create_report
from pkgname.utils.data_loader import load_dengue
from pkgname.utils.print_utils import suppress_stdout, suppress_stderr
features = ["dsource", "age", "gender", "weight", "bleeding", "plt",
"shock", "haematocrit_percent", "bleeding_gum", "abdominal_pain",
"ascites", "bleeding_mucosal", "bleeding_skin", "body_temperature"]
with suppress_stdout() and suppress_stderr():
df = load_dengue(usecols=['study_no']+features)
for feat in features:
df[feat] = df.groupby('study_no')[feat].ffill().bfill()
df = df.loc[df['age'] <= 18]
df = df.dropna()
df = df.groupby(by="study_no", dropna=False).agg(
dsource=pd.NamedAgg(column="dsource", aggfunc="last"),
age=pd.NamedAgg(column="age", aggfunc="max"),
gender=pd.NamedAgg(column="gender", aggfunc="first"),
weight=pd.NamedAgg(column="weight", aggfunc=np.mean),
bleeding=pd.NamedAgg(column="bleeding", aggfunc="max"),
plt=pd.NamedAgg(column="plt", aggfunc="min"),
shock=pd.NamedAgg(column="shock", aggfunc="max"),
haematocrit_percent=pd.NamedAgg(column="haematocrit_percent", aggfunc="max"),
bleeding_gum=pd.NamedAgg(column="bleeding_gum", aggfunc="max"),
abdominal_pain=pd.NamedAgg(column="abdominal_pain", aggfunc="max"),
ascites=pd.NamedAgg(column="ascites", aggfunc="max"),
bleeding_mucosal=pd.NamedAgg(column="bleeding_mucosal", aggfunc="max"),
bleeding_skin=pd.NamedAgg(column="bleeding_skin", aggfunc="max"),
body_temperature=pd.NamedAgg(column="body_temperature", aggfunc=np.mean),
).dropna()
report = create_report(df, title="Dengue dataset report")
report
|
Total running time of the script: ( 0 minutes 5.352 seconds)